04. TD Control: Sarsa

TD Control: Sarsa

Monte Carlo (MC) control methods require us to complete an entire episode of interaction before updating the Q-table. Temporal Difference (TD) methods will instead update the Q-table after every time step.

## Video

TD Control Sarsa Part 1

Watch the next video to learn about Sarsa (or Sarsa(0)), one method for TD control.

## Video

TD Control Sarsa Part 2

## Pseudocode

In the algorithm, the number of episodes the agent collects is equal to num_episodes. For every time step t\geq 0, the agent:

  • takes the action A_t (from the current state S_t) that is \epsilon-greedy with respect to the Q-table,
  • receives the reward R_{t+1} and next state S_{t+1},
  • chooses the next action A_{t+1} (from the next state S_{t+1}) that is \epsilon-greedy with respect to the Q-table,
  • uses the information in the tuple (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) to update the entry Q(S_t, A_t) in the Q-table corresponding to the current state S_t and the action A_t.